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Abstract 

Support vector machines and kernel methods are increasingly popular in genomics 
and computational biology, due to their good performance in real-world applications 
and strong modularity that makes them suitable to a wide range of problems, from 
the classification of tumors to the automatic annotation of proteins. Their ability to 
work in high dimension, to process non-vectorial data, and the natural framework 
they provide to integrate heterogeneous data are particularly relevant to various 
problems arising in computational biology. In this chapter we survey some of the most 
prominent applications published so far, highlighting the particular developments 
in kernel methods triggered by problems in biology, and mention a few promising 
research directions likely to expand in the future. 

1 INTRODUCTION 

Recent years have witnessed a dramatic evolution in many fields of life science with 
the apparition and rapid spread of so-called high-throughput technologies, which 
generate huge amounts of data to characterize various aspects of biological samples 
or phenomena. To name just a few, DNA sequencing technologies have already pro- 
yided the whole genome of severa l hundreds of species, including the human genome 
( Cnnsortiunl 120011 : IVenteT] . l2nnih : DNA microarrays (jSchena et abl . Umit . that al- 



low the monitoring of the expression level of tens of thousands of transcripts simul- 
taneously, opened the door to funct ional genomics, the elucidation of the functions 
of the genes found in the genomes (|DeEisi et al.l . ll997^ : recent advances in ioniza- 



tion technology have boosted large-scale capabilities in mass spectrometry, and the 
rapidly gro wing field of proteomics, foc using on the systematic, large-scale analysis 
of proteins ( Aebersold and MannL 2003h . As biology suddenly entered this new era 



characterized by the relatively cheap and easy generation of huge amounts of data, 
the urgent need for efficient methods to represent, store, process, analyze, and fi- 
nally make sense out of these data triggered the parallel development of numerous 
data analysis algorithms in computational biology. Among them, kernel methods in 
general, and support vector machines (SVM) in particular, have quickly gained pop- 
ularity for problems involving the classification and analysis of high-dimensional or 
complex data. Half a decade after the first pioneering papers (jMukheriee et al, 1 11 9981 : 
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Haussleil Il99fll : I.Taakkola et allll999h . these methods have been apphed to a variety 



of problems in computational biology, with more than 100 research papers published 
m 2004 onlyi. The main reasons behind this fast development, beyond the gener- 
ally good performances of SVM on real-world problems and ease of use provided 
by current implementations, involve (i) the particular capability of SVM to resist 
to high dimensional and noisy data, typically produced by various high-throughput 
technologies, and (ii) the possibility to process non-vectorial data, such as biological 
sequences, protein structures or gene networks, and to easily fuse heterogeneous data 
thanks to the use of kernels. More than a mere application of well-established meth- 
ods to new datasets, the use of kernel methods in computational biology has been 
accompanied by new developments to match the specificities and the needs of the 
field, such as methods for feature selection in combination with the classification of 
high-dimensional data, the invention of string kernels to process biological sequences, 
or the development of methods to learn from several kernels simultaneously. In order 
to illustrate some of the most prominent applications of kernel methods in compu- 
tational biology and the specific developments they triggered, this chapter focuses 
on selected applications related to the manipulation of high-dimensional data, the 
classification of biological sequences, and a few less developed but promising applica- 
tions. This chapter is therefore not intended to be an exhaustive survey, but rather to 
illustrate with some examples why and how kernel methods have invaded the field of 
computatio nal biology so rapidly. The interested reader will find more references in 
the book bv lScholkopf et al. dedicated to the topic. Several kernels for struc- 

tured data, such as sequences or trees, widely develop ed and used in computational 
biology, are also presented in detail in the book by Shawe-Tavlor and Cristianinil 
( 20041) . 



2 CLASSIFICATION OF HIGH-DIMENSIONAL 
DATA 

Several recent technologies, such as DNA microarrays, mass spectrometry or various 
miniaturized assays, provide thousands of quantitative parameters to characterize 
biological samples or phenomena. Mathematically speaking, the results of such ex- 
periments can be represented by high-dimensional vectors, and many applications 
involve the supervised classification of such data. Classifying data in high dimension 
with a limited number of training examples is a challenging task that most statistical 
procedures have difficulties dealing with, due in particular to the risk of overfitting 
the training data. The theoretical foundations of SVM and related methods, how- 
ever, suggest that their use of regularization allows them to better resist to the curse 
of dimension than other methods. SVM were therefore naturally tested on a variety 
of datasets involving the classification of high-dimensional data, in particular for the 
analysis of tumor samples from gene expression data, and novel algorithms were de- 
veloped in the framework of kernel methods to select a few relevant features for a 
given high-dimensional classification problem. 

A list of references is available at http://cg.ensmp.fr/~vert/svn/bibli/html/biosvm.html 
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2.1 Tumor Classification from Gene Expression Data 

The early detection of cancer and prediction of cancer types from gene expression 
data ha ve been among the firs t application s of k ernel methods in computational 
biology (jMukheriee et al. I, I199HI : iFurev et ajJ . |2000') and remain prominent. These 
applications have indeed potentially important impacts on the treatment of can- 
cers, providing clinicians with an objective and possibly highly accurate informa- 
tion to choose the most appropriate form of treatment. In this context, SVM were 
widely applied and compared with other algorithms for the supervised classification 
of tumor samples from expression data, of typically several thousands of genes for 
each tumor. Examples i nclude the discriminatio n between acute myeloid and acute 



lymp hoblastic leukemia ( Mukheriee et al. . 19981 ^ . colon cancer and normal colon tis- 



sues dMoler et a,].l. EElv normal ovarian, normal non-ovarian and cancer ovarian 

melanoma, soft tissue sarcoma and clear c e ll sarc oma 



tissu es k^evetall h 



( Segal et al 



I2nn3bl 'l 



different ty pes of soft tissue sarc omas ()Segal et al 



or 



normal and gastric tumor tissues (|Meireles et al. to name just a few. Another 

typical application is the prediction of the future evolution of a tu mor, such as the 



discrimination between relapsing and nonrelapsing Wilms tumors (jWilliams et al 



200i), the prediction of metastatic or non-metastatic squamous cell carcinoma of the 



oral cavity ()0'Donnell et al.l . 1200^3 ) , or the discrimination between diffuse lar ge B-cell 



lymphoma with positive or negative treatment outcome ( Shipp et al. . 20021 ^. 



The SVM used in these studies are usually linear hard-margin SVM, or linear 
soft-margin SVM with a default C parameter value. Concerning the choice of the 
kernel, several studie s observe that ii onlinear kernels tend to decrease performance 
( Ben-Dor et al. . 20001 : Valentin! . 2002h compared to the simplest linear kernel, which 
is coherent with the intuition that the complexity of learning non-linear functions in 
very high dimension does not play in their favor. On the other hand, the choice of 
hard-margin SVM, sometimes advocated as a default method when data are linearly 
separable, is certainly worth questioning in more details. Indeed, the theoretical 
foundations of SVM suggest that in order to learn in high dimension, one should 
rather increase the importance of regularization as opposed to fitting the data, which 
corresponds to decreasing the C parameter of the soft-margin formulation. A few 
recent papers highlight indeed the fact that the choice of C has an important effect 
on the generalization performance of SVM for classification of gene expression data 
( Huang and Kecmanl . lioO,^) . 

A general conclusion of these numerous studies is that SVM generally provide 
good classification accuracy in spite of the large dimension of the data. For example, 
in a comparative study of several algorithms for multi-class sup ervised c l assific ation, 



including naive Bayes, k-nearest neighbors and decision trees, Li et al. ( 20041 ) note 



that "[SVM] achieve better performance than any other classifiers on almost all the 
datasets". However, it is fair to mention that other studies conclude that most 
algorithms that take into account the problem of large dimension either through 
regulariz ation, or through feat ure selection, reach roughly similar accuracy on most 



regulariz ation, or tnrougn leat ure selection, reacn rougniy similar accuracy on most 
datasets ()Ben-Dor et al From a practical point of view, the use of the simplest 

linear kernel and of the soft-margin formulation of SVM seems to be a reasonable 
default strategy for this application. 
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2.2 Feature Selection 



In the classification of microarray data, it is often important, both for classification 
performance, biomarker identification and interpretation of results, to select only a 
few discriminative genes among the thousands of candidates available on a typical 
microarray. While the literature on feature selection is older and goes beyond the field 
of kernel methods, several interesting developments with kernel methods have been 
proposed in the recent years, explicitly motivated by the problem of gene selection 
from microarray data. 



For example, ISu et al.l (|2003) propose to evaluate the predictive power of a each 
single gene for a given classification task by the value of the functional minimized 
by a one-dimensional SVM, trained to classify samples from the expression of only 
the single gene of interest. This criterion can then be used to rank genes and select 
only a few with important predictive power. This procedure therefore belongs to the 
so-called filter approach to feature selection, where a criterion (here using SVM) to 
measure the relevance of each feature is defined, and only relevant features according 
to this criterion are kept. 

A second general strategy for feature selection is the so-called wrapper approach, 
where feature selection alternates with the training of a classifier. The now widely- 



used recursive feature elimination (RFE) procedure of iGuvon et al.l (|2fln2l ). which 



iteratively selects smaller and smaller sets of genes and trains SVM, follows this 
strategy. RFE can only be applied with linear SVM, which is nevertheless not a 
limitation as long as many features remain, and works as follows. Starting from the 
full set of genes, a linear SVM is trained and the genes with the smallest weights 
in the resulting linear discrimination function are eliminated. The procedure is then 
repeated iteratively starting from the set of remaining genes, and stops when a desired 
number of genes is reached. 

Finally, a third strategy for feature selection, called embedded approach, com- 
bines the learning of a classifier and the selection of features in a single step. A 
kernel method following this strategy has be en implemented in the join t classifier 



and feature optimization (JCFO) procedure of iKrishnapuram et al.l (|2f)n4f ). JCFO is 



roughly speaking a variant of SVM with a Bayesian formulation, in which sparseness 
is obtained both for the features and the classifier expansion in terms of kernel by 
appropriate choices of prior probabilities. The precise description of the complete 
procedure to train this algorithm, involving an expectation-maximization (EM) iter- 
ation, would go beyond the scope of this chapter and the interested reader is referred 
to the original publication for further practical details. 

Generally speaking, and in spite of these efforts to develop clever algorithms, 
the effect of feature selection on the classification accuracy of SVM is still debated. 
Althou gh verv good results are som etimes reported, for example for the JCFO pro- 



Aitnou gn verv good results are som etimes reported, tor example tor tne JUl:*U pro- 
cedure ^fci^na^ra^^^L , 20041 ^ . several studies conclude that feature selection. 



for example with pr ocedures like RFE, do n ot actually i mprove the accuracy of SVM 
trained on all genes ( Ambroise and McLachla n. 2002: Ramaswamv et al. . 200 ll ^. The 
relevance of feature selection algorithms for gene expression data is therefore currently 
still a research topic, that practitioners should test and assess case by case. 
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2.3 Other High-Dimensional Data in Computational Bi- 
ology 

While early applications of kernel methods to high-dimensional data in genomics and 
bioinformatics mainly focused on gene expression data, a number of other applica- 
tions have flourished more recently, some being likely to expand quickly as major 
applications of machine learning algorithms. For example, studies focusing on tissue 
classification from data obtained by other technologies, such as methylation assays, 
to monitor the pat terns of cytosine methylation in the upstream regions of genes 



( Model et all bOQlh . or array comparative genomic hybridization (C GH), to mea- 



sure gene copy number changes in hundreds of genes simultaneously ()Aliferis et al 



are starting to accumulate. A huge field of application that still barely caught 



the interest of the machine learning community is proteomics, that is, the quantitative 
study of the protein content of cells and tissues. Technologies such as tandem mass 
spectromtetry, to monitor the protein content of a biological sample, are now well 
developed, an d classification of tissues from these d ata is a future potential applica- 

200,4 1 Wagner et"aD. booA). ^ Applications in toxicogenomics 
emogenomics ( Bao and Su n. 2002; Bock and Goueh . 20021') 
and analysis of single nucleotide polyphormisms ( Yoon et aL . .200.'^ : .Listgarten et al.l . 
l2004l ^ are also promising applications for which the capacity of SVM to classify high- 
dimensional data has only started to be exploited. 



tion of SVM (Wi et al 



( Steiner et al 
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3 SEQUENCE CLASSIFICATION 

The various genome sequencing projects have produced huge amounts of sequence 
data that need to be analyzed. In particular, the urgent need for methods to auto- 
matically process, segment, annotate and classify various sequence data has triggered 
the fast development of numerous algorithms for strings. In this context, the possi- 
bility offered by kernel methods to process any type of data, as soon as a kernel for 
the data to be processed is available, has been quickly exploited to offer the power 
of state-of-the-art machine learning algorithms to sequence processing. 

Problems that arise in computational biology consist in processing either sets 
of sequences of a fixed length, or sets of sequences with variable lengths. Prom a 
technical point of view the two problems slightly differ: while there are natural ways 
to encode fixed-length sequences as fixed-length vectors, making them amenable to 
processing by most learning algorithms, manipulating variable-length sequences is 
less obvious. In both cases, many successful applications of SVM have been reported, 
combining ingenious developments of string kernels, sometimes specifically adapted 
to a given classification task, with the power of SVM. 

3.1 Kernels for Fixed-Length Sequences 

Problems involving the classification of fixed-length sequences appear typically when 
one wants to predict a property along a sequence, such as the local structure or solvent 
accessibility along a protein sequence. In that case, indeed, a common approach is to 
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use a moving window, that is, to predict the property at each position independently 
from the others, and to base the prediction only on the nearby sequence contained in a 
small window around the site of interest. More formally, this requires the construction 
of predictive models that take a sequence of fixed length as input to predict the 
property of interest, the length of the sequences being exactly the width of the 
window. 

To fix notations, let us denote by p the common length of the sequences, and 
hy X = xi . . . Xp a typical sequence, where each Xi is a letter from the alphabet, 
e.g., an amino-acid. The most natural way to transform such a sequence into a vec- 
tor of fixed length is to first encode each letter itself into a vector of fixed length 
k, and then to concatenate the codes of the successive letters to obtain a vector 
of size pk for the whole sequence. A simple code for letters is the following so- 
called sparse encoding: denoting by a the size of the alphabet, the i-th letter of 
the alphabet is encoded as a vector of dimension a containing only zeros, except 
for the i-th dimension that is set to 1. For example, in the case of nucleotide se- 
quences with alphabet (A, C, G, T), the codes for A, C, G and T would respectively be 
(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1) and the code for the sequence of length 
3 AGT would be (1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1). Several more evolved codes for single 
letters have also been proposed. For example, if one has a prior matrix of pairwise 
similarities between letters, such as widely-used similarity matrices between amino- 
acids, it is possible to replace the 0/1 sparse encoding of a given letter by the vector of 
similarity with other letters; hence the A in the previous example could for instance 
be represented by the vector (1, 0, 0.5, 0) to emphasize one's belief that A and G share 
some similarity. This is particularly relevant for biological sequences where mutations 
of single letters to similar letters are very common. Alternatively, instead of using 
a prior matrix of similarity, one can automatically align the sequence of interest to 
similar sequences in a large sequence database, and encode each position by the fre- 
quency of each letter in the alignment. As a trivial example, if our previous sequence 
AGT was found to be aligned to the following sequences: AG A, AGC, CGT, ATT, 
then it could be encoded by the vector (0.8, 0.2, 0, 0, 0, 0, 0.8, 0.2, 0.2, 0.2, 0, 0.6), cor- 
responding to the respective frequencies of each letter at each position. 

In terms of kernel, it is easy to see that the inner product between sparsely encoded 
sequences is the number of positions with identical letter. In this representation, any 
linear classifier, such as that learned by a SVM, associates a weight to each feature, 
that is, to each letter at each position, and the score of a sequence is the sum of 
the scores of its letters. Such a classifier is usually referred to as a position-specific 
score matrix in bioinformatics. Similar interpretations can be given for other letter 
encodings. An interesting extension of these linear kernels for sequences is to raise 
them to some small power d; in that case, the dimension of the feature space used by 
kernel methods increases, and the new features correspond to all products of d original 
features. This is particularly appealing for the sparse encoding, because a product of 
d binary factors is a binary variable equal to 1 if and only if all factors are 1, meaning 
that the features created by the sparse encoding to the power d exactly indicate the 
simultaneous presence of up to d particular letters at d particular positions. The 
trick to take a linear kernel to some power is therefore a convenient way to create 
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a classifier for problems that involve the presence of several particular letters at 
particular positions. 

A first limitation of these kernels is that they do not contain any information 
about the order of the letters: they are for example left unchanged if the letters in 
all sequences are shuffled according to any given permutation. Several attempts to 



include ordering information have been proposed. For example, iRatsch et al 



iptsto 

liooi) 



replace the local encoding of single letters by a local encoding of k consecutive letters; 
Zien et al.l HqoO) propose an ingenious variant to the polynomial kernel in order to 



restrict the feature space to products of features at nearby positions only. 

A second limitation of these kernels is that the comparison of two sequences only 
involves the comparison of features at identical positions. This can be problematic in 
the case of biological sequences, where insertion of deletions of letters are common , 



resulting in possible shifts within a window. This problem led iMeinicke et al. I \2m^ 



to propose a kernel which incorporates a comparison of features at nearby positions, 
using the following trick: if a feature / (e.g., binary or continuous) appears at position 
i in the first sequence, and a feature g appears at position j in the second sequence, 
then the kernel between the two sequences is increased by Ko{f,g) exp (— (z — /s), 
where Kq{., .) is a basic kernel between the features such as the simple product. When 
a is chosen very large, then one recovers the classical kernels obtained by comparing 
only identical positions {i = j); the important point here is that for smaller values of 
(7, features can contribute positively even though they might be located at different 
positions on the sequences. 

The applications of kernels for fixed-length sequences to solve problems in com- 
putational biology are already numerous. For example, they have been widely used 
to predict local prop erties along protein s e quences using a i novin g window, such as 
secondary structure dHua and Sun ^ 2nnia Guerrneur et al.l. 20041) . d isulfide bridges 

and Frasconil . 120041: IChe n et al.'. '200 ^1, phos phorvla- 



involving cysteines 



tion sites ( Kim et al 



igggrim. 



,20051 



, 2ooir interface re sidues (|Yan et al. . 2004; .Res et al.. . 

or solvent accessibilitv (jYiian et al. . 20021 ). Another important field of application is 
the annotation of DNA, using fixed-length windows centered on a candid ate point of 
inter e st as an input t o a c lassifier to de tect translation initiation sites dZie n et al.!. 

200nl : iMeinicke etHl l2004l ) , splice sites dPesroeve et alil200,4 Ratsch et al 2005.) 

or bi nding sites of transcription factors ( O 'Flanagan et all 2005 ; Sharan and Mvera . 
200,51 ). The recent interest in short RNA such as antisense oligonucleotides or small 



interfering RNAs for sequence-specific knockdown of messenger RNAs has also re- 
sulted in several works in volving classification of such sequences, which hav e typically 



a fixed length by nature ( Camps- Vails et al. . 20041 : Teramoto et al. . 20051 ) . Another 



important application field for these methods is immunoinfo rmatics, including the 
prediction of peptides that can elicit an immune response ( Donnes and Elofssonl . 
2OO2I : Bhasin and Raghaval. 20041) . or the c l assification of immuno globulins collected 



from sain or ill patients (|Zavalievski et al.l . I2OO2I : IYu et al.l . l2005h . In most of these 



applications, SVM lead to comparable if not better prediction accuracy than com- 
peting state-of-the-art methods such as neural networks. 
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3.2 Kernels for Variable- Length Sequences 

Many problems in computational biology involve sequences of different lengths. For 
example, the automatic functional or structural annotation of genes found in se- 
quenced genomes requires the processing of amino-acid sequences with no fixed 
length. Learning from variable- length sequences is a more challenging problem than 
learning from fixed-length sequences, because there is no natural way to transform 
a variable-length string into a vector. For kernel methods, this issue boils down to 
the problem of defining kernels for variable-length strings, a topic that has deserved 
a lot of attention in the last few years and has given rise to a variety of ingenious 
solutions summarized in this section. 

The most common approach to make a kernel for strings, as for many other types 
of data, is to design explicitly a set of numerical features that can be extracted from 
strings, and then to f orm a kernel as a d ot product between the resulting feature vec- 
tors. As an example. iLeslie et al. |(|2002l) represent a sequence by the vector of counts 



of occurrences of all possible k-meis in the sequence, for a given integer k, effectively 
resulting in a vector of dimension a^, where a is the size of the alphabet. As an exam- 
ple, the sequence AACGTCACGAA over the alphabet {A, C, G, T) is represented by 
the 16-dimensional vector (2,2,0,0,1,0,2,0,1,0,0,1,0,1,0,0) for k = 2, where the 
dimensions are the counts of occurrences of each 2-mer AA, AC, ...,TG,TT lexico- 
graphically ordered. The resulting spectrum kernel between this sequence and the 
sequence AC G AAA, defined as the linear product between the two 16-dimensional 
representation vectors, is equal to 9. It should be noted that although the number of 
possible /c-mers easily reaches the order of several thousands as soon as k is equal to 
3 or 4, classification of sequences by SVM in this high-dimensional space results in 
fairly good results. A major advantage of the spectrum kernel is its fast computation; 
indeed, the set of fc-mers appearing in a given sequence can be indexed in linear time 
in a trie structure, and the inner product between two vectors is linear with respect 
to the non-zero coordinates, i.e., at most linear in the total lengths of the sequences. 
Several variants to the basic spectrum kernel have also been proposed, including for 
example kerne ls based on counts of fe-mers appearing with up to m mismatches in 
the sequences ([Leslie et al.l . l2004l ^ . 



Another natural approach to vector representation for variable-length strings 
is to replace each letter by one or several numerical features, such as physico- 
chemical properties of amino-acids, and then to extract features from the resulting 
variable-length numeri cal time series using classical signal processing te chniques such 



as Fou rier transforms (|Wang et al.l . |200J) or autocorrelation analysis ([Zhang et al 



H). For example, if h,,...,K~ot. n numerical features associated to the 



successive letters of a sequence of length n, then the autocorrelation function rj for 
a given j > is defined by 

:y^^hihi 



n — 1 — ' 

1=1 



One can them keep a fixed numbers of these coefficients, for example ri, . . . , rj, and 
create a J-dimensional vector to represent each sequence. 

A completely different approach for kernel design is to derive them from prob- 
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abilistic models. Indeed, before the interest on string kernels grew, a number of 
ingenious probabilistic models had been defined to represent biological sequences 
or families of sequences, including for example Markov and hidden Markov mod- 
els for protein seque nces, or stochastic context-free grammars for RNA sequences 



( Durbin et all Il998l ) . Several authors have therefore explore d the possibility to use 



such models to make kernels, starting with the seminal work of lJaakkola et alJ (|2000|) 
that introduced the Fisher kernel. The Fisher kernel is a general method to extract a 
fixed number of features from any data x for which a parametric probabilistic model 
Pg is defined. Here, 9 represents a continuous d-dimensional vector of parameters for 
the probabilistic model, such as transition and emission probabilities for a hidden 
Markov model, and each Pg is a probability distribution. Once a particular param- 
eter ^0 is chosen to fit a given set of objects, for example by maximum likelihood, 
then a d-dimensional feature vector for each individual object x can be extracted by 
taking the gradient in the parameter space of the log- likelihood of the point: 

= V g log Pg{x). 

The intuitive interpretation of this feature vector, usually referred to as the Fisher 
score in statistics, is that it represents how changes in the d parameters affect the 
likelihood of the point x. In other word, one feature is extracted for each parameter 
of the model; the particularities of the data point are seen from the eyes of the 
parameters of the probabilistic model. The Fisher kernel is then obtained as the 
dot product of these d-dimensional vectors, eventually multiplied by the inverse of 
the Fisher information matrix to render it independent of the parametrization of the 
model. 

A second line of thoughts to make a ker nel ou t of a parametric probabilistic model 
is to use the concept of covariance kernels (jSeeeer. . .20021 . that is, kernels of the form: 



K{x,x') = J Pg{x)Pe{x')dfi{9), 



where dfj, is a prior distribution on the parameter space. Here, the features corre- 
spond to the likelihoods of the objects under all distributions of the probabilistic 
model; objects are considered similar when they have large likelihoods under similar 
distributions. An important difference with the kernels seen so far is that here, no 
explicit extraction of finite-dimensional vectors can be performed. Hence for practical 
applications one must chose probab ilistic models th at allow the computation of the 



integral above. This was carried bv lCuturi and Vert (2005) who present a family of 



variable-length Markov models for strings and an algorithm to perform the integral 
over parameters and models in the same time, resulting in a string kernel with linear 
complexity in time and memory with respect to the total length of the sequences. 

Alternatively, many probabilistic models for biological sequences, such as hidden 
Markov models, involve a hidden variable that is marginalized over to obtain the 
probability of a sequence, i.e., can be written as 



Pix) = Y,n^,h). 
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For such distributions, iTsuda et al. introduced the notion of marginalized 

kernel, obtained by marginahzing a kernel for the complete variable over the hidden 
variable. More precisely, assuming that a kernel for objects of the form (x, h) is 
defined, the marginalized kernel for observed objects x is given by 



K{x, x') = ^K ((x, h), (x', h')) P{h\x)P(h'\x'). 



In order to motivate this definition with a simple example, let us consider a hidden 
Markov model with two possible hidden states, to model sequences with two possible 
regimes, such as introns/exons in eukaryotic genes. In that case the hidden variable 
corresponding to a sequence x of length n is a binary sequence h of length n describing 
the states along the sequence. For two sequences x and x', if the correct hidden states 
h and h' were known, such as the correct decomposition into introns and exons, then it 
would make sense to define a kernel K {{x, h), {x' , h')) taking into account the specific 
decomposition of the sequences into two regimes; for example, the kernel for complete 
data could be a spectrum kernel restricted to the exons, i.e., to positions with a 
particular state. Because the actual hidden states are not known in practice, the 
marginalization over the hidden state of this kernel using an adequate probabilistic 
model can be interpreted as an attempt to apply the kernel for complete data by 
guessing the hidden variables. As for the covariance kernel, marginalized kernels 
can often not be expressed as inner products between feature vectors, and require 
computational tricks to be computed. Several beautiful examples of such kernels for 
various proba bilistic models have been worked out , including hidden Markov models 
for sequences dTsuda et ah!. 120021: IVert et alLbnOfil 'l. stochastic context-free grammars 
for RNA se quences dXin et a,l.l. 120021 ). or random walk models on graphs for molecular 



structures ( Kashima et al.l 



^aimn 

.1. 120041 ^. 



Following a different line of thought, iHaussler introduced the concept 

of convolution kernels for objects that can be decomposed into subparts, such as 
sequences or trees. For example, the concatenation of two strings xi and X2 results 
in another string x = xiX2- If two initial string kernels Ki and K2 are chosen, then 
a new string kernel is obtained by convolution of the initial kernels following the 
equation: 

K{x,x')= Ki{xi,x'i)K2{x2,x'2). 



Here the sum is over all possible decompositions of x and x' into two concatenated 
subsequences. The rational behind this approach is that it allows the combination 
of different kernels adapted to different parts of the sequences, such as introns/exons 
or gaps/aligned residues in alignment, without knowing the exact segmentation of 
the sequences . Bes ides proving that the convolution of two kernels is a valid ker- 
nel, lHaussleil (jl999h gives several examples of convolution kernels relevant for bio- 
logical sequences; for example, he shows that the joint probability x') of two 
sequences under a pa ir HMM model is a valid kernel, under mild assumptions. This 
work is extended by Vert et al. ( 20041 ) where a valid convolution kernel based on 
the alignment of two sequences is proposed. This kernel, named local alignment 
kernel, is a close relative of the widely used Smith- Waterman local alignment score 
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( Smith and Watermanl . 198ll ^ , and gives excellent results on the problem of detecting 
remote homologs of proteins. 

Finally, another popular approach to design features and therefore kernels for 
biological sequences is to "project" them onto a fixed dictionary of sequences or mo- 
tifs, using classical similarity mea sures, and to use th e resulting vector of similarities 
as feature vector. For example, Logan et al. ( 200 ll ^ represent each sequence by a 
10,000-dimensional vector indicating th e presence of 10,000 motifs of the BLOCKS 
database; similarly, iBen-Hur and Brutlag ( 2003i) use a vector that indicates the pres- 
ence or absence of about 500,000 motifs in the eMOTIF database, requiring the use 
of a trie structure to compute efficiently th e kernel without explicitly storing the 



500,000 features; and Liao and Noblel (|200,^ ) represent each sequence by a vector of 
sequence similarities with a fixed set of sequences. 

These kernels for variable-length sequences have been widely applied, often in 
combination with SVM, to various classification tasks in computational biology. Ex- 
amples including t he prediction of protein structural or fun ctional class es from their 
primary sequence dPing and DubchakL I2OO1I : I.Taakkola et al.. . ,2000; Vert et al.1 . bood : 
Karchin et al. j_ 20021: Cai et all 2003h. the prediction of the subcellular localizati on of 
nroteins ()Hua and Sunt I2OOI bl: Ipark an d Kanehisal. 1200,'^ : iMatsuda et"an. l200,'jV the 
classi fication of transfer RNA ( IKin et al.i . i2002 ) and non-coding RNA (IKar klin et al 



2[ 

20031 



the prediction of pseudo-exons and alternatively spliced exons (fZhang et al 



Dror et al.1. l200,^ 'l. the separation of mixed plant-pathogen ES T collections 



dFriedel et al.1 . I2OO5I ). the classification of ma mmalian viral g enomes ( Rose et al 
1200,^ ). or the prediction of ribosomal proteins (iLin et al. 



2002). 



This short review of kernels developed for the purpose of biological sequence clas- 
sification, besides highlighting the dynamism of research in kernel methods resulting 
from practical needs in computational biology, naturally raises the practical ques- 
tion of which kernel to use for a given application. Although no clear answer has 
emerged yet, some lessons can be learned from early studies. First, there is certainly 
no kernel universally better than others, and the choice of kernel should depend on 
the targeted application. Intuitively, a kernel for a classification task is likely to work 
well if it is based on features relevant to the task; for example, a kernel based on 
sequence alignments, such as the local alignment kernel, gives excellent results on 
remote homology detection problems, while a kernel based on the global content of 
sequences in short subsequences, such as the spectrum kernel, works well for the pre- 
diction of subcellular localization. Although some methods for systematic selection 
and combination of kernels are starting to emerge (see next section) , empirical eval- 
uation of different kernels on a given problem seems to be the most common way to 
chose a kernel. Another important point to notice, besides the classification accuracy 
obtained with a kernel, is its computational cost. Indeed, practical applications often 
involve datasets of thousands or tenth of thousands of sequences, and the compu- 
tational cost of a method can become a critical factor in this context, in particular 
in an online setting. The kernels presented above differ a lot in their computational 
cost, ranging from fast linear-time kernels like the spectrum kernel, to slower ker- 
nels like the quadratic-time local alignment kernel. The final choice of kernel for a 
given application often results from a trade-off between classification performance 
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and computational burden. 



4 OTHER APPLICATIONS AND FUTURE 
TRENDS 

Besides the important applications mentioned in the previous sections, several other 
attempts to import ideas of kernel methods in computational biology have emerged 
recently. In this section we highlight three promising directions that are likely to 
develop quickly in the near future: the engineering of new kernels, the development 
of methods to handle multiple kernels, and the use of kernel methods for graphs in 
systems biology. 



4.1 More Kernels 



The power of kernel methods to process virtually any sort of data as soon as a 
valid kernel is defined has recently been exploited for a variety of data, besides 
high-dimensional data and sequences. For example, VertI (|2002! ) derives a kernel for 
phylogenetic profiles, that is, a signature indicating the presence or absence of each 
gene in all sequenced genomes. Several recent works have investigated kernels for 
protein 3D structures, a topic that is likely to expand quickly w ith the foreseeable 
availability of predicted or s olved structures for whole genomes ( Dobson and Doi3 . 
200.4 iBorgwardt et ahlEinih . For smaller molecules, several kernels based on planar 
or 3D structures h ave eme rged, wi th manv potential appli cations in comput ational 



chemistry (Kashi ma et al 2004: .Mahe et al.. . 2005: Swa midass et al. . 2005h . This 



trend to develop more and more kernels, often designed for specific data and applica- 
tions, is likely to continue in the future because it has proved to be a good approach 
to obtain efficient algorithms for real- world applications. A nice by-product of these 
efforts, which is still barely exploited, is the fact that any kernel can be used by 
anv kernel metho ds, paving the way to a multitude of applica tions such as clustering 



anv Kernel metno ds, pavmg tne way to a muitituae or appnca t 
( Qin et al. . 2003h or data visualization ( Komura et al. . l2005l ^. 



4.2 Integration of Heterogeneous Data 

Operations on kernels provide simple and powerful tools to integrate heterogeneous 
data or multiple kernels; this is particularly relevant in computational biology, where 
biological objects can typically be described by heterogeneous representations, and 
the availability of a large number of possible kernels for even a single representation 
raises the question of choice or combination of kernels. Suppose for instance that 
one wants to perform a functional classification of genes based on their sequences, 
expression over a series of experiments, evolutionary conservation, and position in 
an interaction network. A natural approach with kernel methods is to start by 
defining one or several kernels for each sort of a data, that is, string kernels for the 
gene sequences, vector kernels to process the expression profiles, etc... The apparent 
heterogeneity of data types then vanishes as one simply obtains a family of kernel 
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functions Ki, Kp. In order to learn from all data simultaneously, the simplest 
approach is to define an integrated kernel as the sum of the initial kernels: 



i=l 

The rational behind this sum is that if each kernel is a simple dot product, then 
the sum of dot products is equal to the dot product of the concatenated vectors. 
In other words, taking a sum of kernels amounts to putting all features of each 
individual kernel together; if different features in different kernels are relevant for a 
given problem, then one expects the kernel method trained on th e integrated k ernel to 
pick those relevant features. This idea was pioneered by Pavli dis et al. ( 20021 ^ where 
gene expression profiles and gene phylogenetic profiles are integrated to predict the 
functional classes of genes, effectively integrating evolutionary and transcriptional 
information. 

An interesting generalization of this approach is to form a convex combination of 
kernels, of the form: 

p 



where the Wi are nonnegative weights. iLanckriet et al. (j2nfl4h propose a general 



framework, based on semidefinite programming, to optimize the weights and learn 
a discrimination function for a given classification task simultaneously. Promising 
empirical results on gene functional classification show that by integrating several 
kernels, better results than each individual kernel can be obtained. 

Finally, other kernel methods can be us ed to compare and search correlation 
between heterogeneous data. For example, Vert and Kanehisal (j2003h propose to 
use a kernelized version of canonical correlation analysis (CCA) to compare gene 
expression data, on the one hand, with the position of genes in the metabolic network, 
on the other hand. Each type of data is first converted into a kernel for genes, the 
information about gen e positions in the metabo lic network being encoded with the so- 



called diffusion kernel ( Kondor and Veril . 20041 ). These two kernels define embeddings 



of the set of genes into two Euclidean spaces, in which correlated directions are 
detected by CCA. It is then shown that the directions detected in the feature space of 
the diffusion kernel can be interpreted as clusters in the metabolic network, resulting 
in a method to monitor the expression patterns of metabolic pathways. 



4.3 Kernel Methods in Systems Biology 

Another promising field of research where kernel methods can certainly contribute is 
systems biology, which roughly speaking focuses on the analysis of biological systems 
of interacting molecules, in particular biological networks. 

A first avenue of research is the reconstruction of biological networks from high- 
throughput data. For example, the prediction of interacting proteins to reconstruct 
the interaction network can be posed as a binary classification problem - given a 
pair of proteins, do they interact or not?-, and can therefore be tackled with SVM as 
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soon as a kernel between pairs of proteins is defined. As the primary data available 
to make the interaction prediction are about each single protein, it is natural to try 
to derive kernels for pairs of protein from kernel f or single proteins. This has been 
carried out for example by Bock and Gouglil ( 2001 ) who characterize each protein by 
a vector, and concatenate two such individual vectors to represent a p r otein pair. 
Observing that there is us ually no order in a protein pair, Martin et al.l (|2005l ^ and 
Ben-Hur and propose to define a kernel between pairs (^4, B) and (C, D) 

by the equation: 

Kp {{A, B), [C, D)) = K,{A, C)Ki{B, D) + Ki{A, D)K,{B, C), 

where Ki denotes a kernel for individual protein and Kp the resulting kernel for pairs 
of proteins. The rationale behind this definition is that in order to match the pair 
{A, B) with the pair (C, D), one can either try to match A with C and B with D, or 
to match A with D and B with C . Reported accuracies on the problem of protein 
interaction prediction are very high, confirming the potential of kernel methods in 
this fast-moving field. 

A par allel approach to netwo rk inference from genomic data has been investi- 
gated by Yamanishi et al. ( 20041 ^ . who show that learning the edges of a network 
can be carried out by first mapping the vertices, e.g., the genes, onto a Euclidean 
space, and then connecting the pairs of points which are close to each other in this 
embedding. The problem then becomes that of learning an optimal embedding of 
the vertices, a problem known as distance metric learning that recently caught the 
attent ion of the machine learniri g community and for which several kernel methods 
exist ( Vert and Yamanishil . 2005h . 

Finall Vj several other emerging application in systems biology, s uch as inference on 
networks (jTsuda and Nobl3 . l2004l ) or classification of networks ( Middendorf et al. , 
20041 ) ■ are likely to be subject to increasing attention in the future, due to the growing 
interest and amount of data related to biological networks. 



5 CONCLUSION 

This brief survey, although far from being complete, highlights the impressive ad- 
vances in the applications of kernel methods in computational biology in the last 5 
years. More than a just importing well-established algorithms to a new application 
domain, biology has triggered the development of new algorithms and methods, rang- 
ing from the engineering of various kernels to the development of new methods for 
learning from multiple kernels or for feature selection. The widespread diffusion of 
easy-to-use SVM softwares, and the ongoing integration of various kernels and kernel 
methods in major computing environments for bioinformatics, are likely to foster 
again the use of kernel methods in computational biology, as long as they will pro- 
vide state-of-the-art methods for practical problems. Many questions remain open, 
regarding for example the automatic choice and integration of kernels, the possibil- 
ity to incorporate prior knowledge in kernel methods, and the extension of kernel 
methods to more general kernels that positive definite, suggesting that theoretical 
developments are also likely to progress quickly in the near future. 
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